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Schooling, an archetype of collective behavior, emerges from the interactions of fish responding 
to visual and other informative cues mediated by their aqueous environment. In this context, a 
fundamental and largely unexplored question concerns the role of hydrodynamics. Here, we inves¬ 
tigate schooling by modeling swimmers as vortex dipoles whose interactions are governed by the 
Biot-Savart law. When we enhance these dipoles with behavioral rules from classical agent based 
models we hnd that they do not lead robustly to schooling due to flow mediated interactions. In 
turn, we present dipole swimmers equipped with adaptive decision-making that learn, through a 
reinforcement learning algorithm, to adjust their gaits in response to non-linearly varying hydro- 
dynamic loads. The dipoles maintain their relative position within a formation by adapting their 
strength and school in a variety of prescribed geometrical arrangements. Furthermore, we iden¬ 
tify schooling patterns that minimize the individual and the collective swimming effort, through 
an evolutionary optimization. The present work suggests that the adaptive response of individual 
swimmers to flow-mediated interactions is critical in fish schooling. 
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I. INTRODUCTION 


Schooling, encountered in over ten thousand species [T] , is believed to provide several advantages to fish [2] including 
protection and defense against predators mull], enhanced foraging [5] and mating success [6] . It is also plausible that 
fish benefit from increased hydrodynamic efficiency [7]. Understanding the governing mechanisms in fish schooling 
and exploiting them for rational engineering designs [8] requires that we elucidate the interplay between social and 
hydrodynamic interactions among swimmers. 

While such distinctions may be difficult to investigate in experimental or natural settings, the detailed information 
that can be obtained via simulations are invaluable. At the same time while schooling can be readily observed in natural 
and experimental settings, in simulations it is essential to equip the individuals with an appropriate behavioral model 
to achieve such group dynamics. Agent based models [9] that lead to schooling or flocking rely on local interaction 
rules handcrafted a priori based on empirical arguments and experimental observations pUUTS] . These models have 
been a key tool in helping to understand the influence of social traits in the emergence of schooling patterns [HHn]. 
However, they do not explicitly account for the flow environment. We consider this a limitation especially in the case 
of large, tightly packed fish assemblies. In fact, a natural swimmer which wishes to adapt its speed and orientation 
to satisfy local interaction rules (e.g. move with the average velocity of its neighbors) needs to translate this into 
specific body gaits. These actions perturb the flow field, which in turn affects the dynamics of the neighbors. 

It is also important to distinguish between self-propelled swimmers and swimmers that are towed with a specified 
velocity through the flow field m, as it is usually implied in agent based models. In the case of a towed swimmer, if 
hydrodynamics is included it only affects the energy expenditure for towing the swimmer with the specified velocity, 
while it does not influence its dynamics nor its trajectory. A self-propelled swimmer instead has to adjust its gait 
to compensate for non-linearly varying hydrodynamic loads to propel itself in a desired direction. As fish rely on 
self-propulsion, it is essential to capture this trait altogether with the long range fluid coupling. To the best of our 
knowledge, such hydrodynamic interactions have not been included in agent based models of swimming. Hence, 
fundamental questions on how fish respond to each other’s wakes and to what extent is schooling the result of their 
synthesized vortex held or their social traits, remain largely unanswered. 

Swimmers influence their flow environment which in turn affects the dynamics of the individual at all scales. In 
the Stokes flow regime, it has been noted that the collective motion of microorganisms induces flow coupling that 
leads to transitions from ordered to disordered patterns [TqHSS]- Both in the inviscid limit and at finite Reynolds 
numbers, recent works have demonstrated that specific body motions can propel initially stationary neighbors [231IM] . 
while models of rotating discs at finite Reynolds numbers have been shown to lead to the emergence of patterns [25] . 
Experimental observations indicate that some fish species arrange themselves in diagonal formations Eam] and it 
has been suggested (T] |28l El] that fish in diamond configurations can exploit the vorticity created by their neighbors 
to decrease their energy expenditure. This hypothesis relies on stable, periodic fish arrangements, prescribed gait 
and unperturbed or minimally perturbed flow conditions. At intermediate and large Reynolds numbers the flow 
field synthesized by the vorticity shed by multiple swimmers |3Qll33] is noisy, and varying loads are induced on the 
swimmers depending on their relative location [34]. How can swimmers overcome this noisy environment to achieve 
specific behavioral or physical goals? To what extent flow-mediated interactions affect decision making and group 
behavior? 

In this article we investigate the effect of flow-mediated interactions on the internal structure and global shape of 
schools composed by hundreds of model dipole swimmers. This work is inspired by the concept of Vortobots [35]. The 
Vortobots were envisioned as simplified rotating bodies (vortices) that move in swarms by controlled hydrodynamic 
interactions. Here, following Tchieu et al. [36], swimmers are modeled as self-propelled, finite width dipoles capable 
of accelerating, decelerating and turning. We show that the use of non-adaptive a priori defined local interaction 
rules does not robustly allow swimmers to maintain finite size schooling formations, causing them to diverge from one 
another or to collide. In turn we show that swimmers can learn, through a reinforcement learning (RL) algorithm [37] . 
to dynamically adjust their swimming actions in response to flow-mediated interactions so as to swim in arbitrary finite 
size schooling arrangements. Furthermore, we identify schooling arrangements that minimize collective swimming 
effort, via an evolutionary optimization technique. Finally, the relative effort of swimmers distributed within an 
optimal school is investigated. 

Our study highlights the importance of accounting for the hydrodynamic environment in collective dynamics, and 
outline a rigorous methodology for identifying optimal adaptive action policies so as to respond to flow-mediated 
interactions. 
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FIG. 1. Schematic of reinforcement learning coupled with a low-order model for swimmers in an inviscid flow. The goal of the 
nth agent is encoded into the numerical reward Vn and the agent learns, through a trial and err or pr ocess called reinforcement 
learning m, how to map states Sn into actions an to maximize the long term reward (Section |ll D[ ). The system is simulated 
via a low-order model that takes into account swimmer-swimmer dynamics mediated by an inviscid fluid medium m- 


II. LEARNING OPTIMAL BEHAVIOR IN A FLUID-MEDIATED ENVIRONMENT 

We examine the collective behavior of model dipole swimmers. In order to control their velocity and bearing, the 
swimmers can adapt their dipole strengths and as such they affect the environment and, in turn, the dynamics of 
all other swimmers. In contrast to classical agent based models [HHE], besides including hydrodynamics, we do 
not specify a priori local interaction rules. Instead, these are automatically identified by a reinforcement learning 
algorithm. 

The dynamics of a system of N swimmers immersed in an inviscid flow is represented as a low-order model denoted 
as ‘finite width dipole’ [36]. The goal of the swimmers is to swim coherently in a prescribed formation, avoiding 
collisions or dispersion. This is a necessary preliminary step to allow f or th e hydrodynamic characterization of 
different swimming formations in terms of energetic expenditure (Section |IIE| ). Given the swimmers repertoire of 
possible actions and sensorial representation of the environment (denoted as states), reinforcement learning [37] 
allows them, through trial and error, to discover an optimal behavioral policy (i.e. a mapping between states and 
actions) to maintain their relative positions within the school. Each loop in Fig. [^represents a single learning instance 
where all agents use their learned policy to select an action which alters their state through the modeled dynamics. 
The reward associated with the new state aids the agents in improving their policy, which eventually converges to an 
optimal policy. 


A. A finite width dipole model for hydrodynamically interacting swimmers 

The flow field generated by individual natural swimmers possesses a complex signature that is greatly affected by 
their gait, morphology and size as well as by viscous and three-dimensional effects. The characterization of group 
dynamics of hundreds of swimming bodies that resolves this level of detail is to date computationally beyond reach. 
Therefore, we study swimmers modeled as finite self-propelling dipoles (Fig. Hf -b) immersed in an inviscid, unbounded 
and incompressible flow [36] . This model reflects the fact that the far field associated with a self-propelled undulating 
body is dipolar, to leading order [38]. The finite dipole model represents a drastic idealization of a swimmer since it 
abstracts from morphological and kinematic traits, it is massless and therefore disregards the inertia of a solid body 
and does not account for three-dimensional and viscous effects, such as separation and vortex shedding. Nevertheless, 
it does capture at first order the flow coupling among self-propelled bodies sufficiently spaced apart (more than one 
body length as estimated in [36]), and it is computationally effective. Bearing in mind its limitations, the dipole 
model renders itself instrumental to qualitative computational inquiries of fish schooling. Finally, we iterate that 
self-propelled agents are distinct from agents that are towed with a certain velocity. In agent based models, the latter 
are usually employed, but these do not correspond to self-propelled animals responding to non-linear interactions with 
the flow field. 

The basis of this low-order model is depicted in Fig. [^ In a system of N dipole swimmers, a dipole located at 
(n = 1, 2,..., N) is decomposed into two vortices located at xj^ and with circulation strengths F^ and FJ!^, separated 
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FIG. 2. Streamlines of (a) inviscid swimmers translating in a potential flow and (b) the finite dipole approximation of the 
swimmers m- Given in (c) is the detailed view of a single finite dipole, (d) ill ustrates the state representation of an individual 
dipole swimmer attempting to follow a translating lattice point (Section IID). 


by a constant distance i. The dipole swimmer travels with a bearing defined by as depicted in the Fig. He. Each 
finite dipole swimmer can change its vortex strengths as a means of controlling its bearing and speed. Following [36], 
the equations of motion of N self-propelled interacting finite dipoles are modified to allow each dipole to change its 
individual bearing and speed while simultaneously affecting the flow. Note that this model is different from the one 
used in [39] such that the generated flow field affects the bearing of swimmers and that the swimmers directly change 
the flow field when performing actions. We also note that the value i can be related to a characteristic width of the 
swimmer D by matching the far-held dipolar strength of a body moving in an inviscid huid to that of a hnite dipole, 
resulting in ^ = D/(2\/^). To proceed, a point x is mapped to the complex ^-plane such that x = (x^y) z = x-\-iy^ 
where i = Therefore, given the position of a dipole x^ i-^ its two vortices of strengths F^ and FJ!^, separated 

by a constant distance are located at (Fig. 
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is the interaction term due to all other dipoles in the environment. We emphasize that the hrst terms of Eq. ( |2a| ) and 
Eq. ( |2b| ) correspond to the dipole self-induced velocity and bearing rate when no other swimmers or background how 
is present. Therefore, these terms relate to the ability of an agent to ahect its own speed and bearing. 

We wish to stress the fact that the dipole model allows us to evolve the system in time by actually solving the Euler 
equations for an incompressible, inviscid how and therefore the presence of a liquid environment is not simplistically 
modeled through ad hoc local interaction rules. The major advantage is that this formulation provides a neat 
distinction between social and hydrodynamic ehects, unlike previous modeling approaches. 


B. Swimming gaits and maneuvering through circulation change 

We extend the original hnite dipole model to equip each swimmer with a set of gaits or actions, as depicted in 
Eig. ^ It can travel forward at three distinct speeds (nominal speed), v~ = (slow), or 

(fast)7 or turn left or right with turn radius while traveling at speed . These actions are realized by allowing 
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FIG. 3. The set of actions available to each dipole swimmer (represented by triangles of width D — pointing in the 

direction of travel). These actions are: (a) traveling straight at its nominal speed , (b) traveling straight at a slower speed 
v~, (c) traveling straight at a faster speed v~^, (d) making a left, and (e) making a right at a specified turn radius pr- In the 
figures we also show the streamlines demonstrating how a swimmer affects the background flow field in the absence of all other 
dipole swimmers. The actions are mapped to the integer set an = {1,2, 3, 4, 5}, respectively. 


each dipole swimmer to instantaneously adjust its vortex circulations F^ and The five actions, mapped to integer 
values, adjust the circulation according to the rule 


if 


~ 2 I 

^ — 3 . 

Qn = 4 : 

^ Un = 5 : 


travel straight at 
travel at straight v~ 
travel at straight v~^ 
turn left with radius 
turn right with radius p^ 


pi _ pr _ pO 

pi _ pr _ pO pA 

pi _ pr _ pO I pA 

pi _ pO pT pr _ pO _ pT 

pi _ pO _ pT pr _ pO pT 


(4) 


where F^,F^,F^ > 0. The nominal vortex strength F^ is related to the cruise velocity and characteristic size i 
such that F^ = 27riv^. 

The additional circulations ±F^ and ±F^ due to traveling fast, slow, or turning right or left, respectively, are fixed 
by the swimming parameters ^ and p^. These values are related to the nominal circulation by 


F^ = 



ft = 



(5a) 

(5b) 


The change in circulation in turn modifies the flow field and thus influences all swimmers in the system. Note that 
these actions are exclusive, i.e. a swimmer can only select one action at a time. 

We emphasize that the use of five actions is a simplification with respect to naturally occurring swimmers, which 
are characterized by a large number of kinematic degrees of freedom and therefore have the ability to fine tune their 
gaits in response to environmental cues. However, a small number of actions drastically reduces the computational 
costs associated with identifying an optimal behavioral policy through reinforcement learning. Hence our choice to 
equip the agents with a limited repertoire of gaits. 


C. The Aoki-Couzin behavioral model with hydrodynamics interactions 

We examine how the classical Aoki-Couzin model El El with a priori defined interaction rules would perform in 
the presence of hydrodynamics. In particular, we considered the so called ‘dynamically parallel school’ and ‘highly 
parallel school’ behavioral rules, as detailed in m- 

In the Aoki-Couzin model collective behavior emerges due to three a priori specified rules among agents. Each 
agent tries to avoid collision with neighbors, aligns to the moving direction of the agents contained in a larger 
neighborhood and, finally, tends to approach the agents of an even larger neighborhood. Given these three rules, each 
agent first computes its desired direction and subsequently turns in order to meet it m- By varying the size of the 
interaction regions, qualitatively different behaviors can be observed. In Table 1, we summarize the radii characteristic 
of ‘dynamically parallel school’ and ‘highly parallel school’, as indicated in El and used here. 
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In order to cast the Aoki-Couzin model into the present dipole framework, each dipole agent determines its desired 
direction ^desired by following the specifications of m and then adjusts its vortex circulations as follows 


Q^desired 


^desired ^0 
T 


Tadd — (bdesired bg) 

r' = r' + r^dd 

r-- = r-- - Tadd 


(6) 


where r and ag, dg are, respectively, the simulations time step and the agent’s current bearing and bearing rate. 
Computing the new and F^ corresponds to adapting gaits to match exactly the desired bearings, assuming the 
absence of all other swimmers. Once the circulations of each dipole are determined, the system of N swimmers is 
evolved accounting for hydrodynamic coupling via the governing Eqs. 


The behaviour of this model in the presence of hydrodynamic interactions is examined in Section IIIB 


Behavior Zone of repulsion Zone of orientation Zone of attraction 

Highly parallel 1 12 15 

Dynamically parallel 1 6 10 

Tab. 1: Radii of zones of interaction relative to ‘dynamically parallel school’ and ‘highly parallel school’ behaviors as detailed 

in HU. Radii are normalized by 1. 


D. Reverse engineering of dynamic interaction rules via reinforcement learning 


Swimmers are modeled as finite width dipoles with varying circulation strength. Their presence and actions affect 
the flow field and, in turn, all other swimmers. Due to this highly non-linear coupling it is virtually impossible to 
handcraft local interaction rules that allow the dipoles to coherently swim in any predefined, finite size schooling 
arrangement. Therefore agent based models, with a priori defined rules cannot help assess the hydrodynamic proper¬ 
ties of different schooling configurations. In turn we employ a reverse engineering approach to obtain the interaction 
rules among swimmers. We specify for the agents the goal of maintaining a given geometric arrangement and employ 
a reinforcement learning technique to identify an appropriate interaction policy. This approach relies on four key 
components: the reward that encodes the agent’s goal; the state that formalizes what the dipole can se nse o f the 
surrounding environment; the aetions^ that is the repertoire of gaits at disposal of the swimmer (Section |IIB ); and 
finally a learning strategy based on trial and error. 

In this study we employ a particular RL technique, namely the one-step Q-learning algorithm m- Beside its 
algorithmic simplicity, Q-learning has been proven to converge to an optimal behavioral policy for finite Markov de¬ 
cision processes m- In this setting, the swimming agent explores the environment and its experience is represented 
by the tuple where s'^ is the next state given the action taken from the current state 5^, and 

is the corresponding reward. An agent estimates by trial and error the action-value function Qui^n^cin)^ Ibe 
expected long term reward for taking action given the state Sn (a schematic of this approach is depicted in Fig. [^. 
The action-value function can be understood as a table or a matrix in which for every state-action entry the 
corresponding expected reward (estimated through the reward history) is stored. This table is consulted by the agent 
whenever an action has to be taken, and it is continuously updated as the system evolves. Therefore, Qn encodes 
the swimmer adaptive decision making intelligence and the corresponding behavior is determined by choosing from 
Qn: with probability 1 — e, the best action such that max^^ Qn{sn,CLn) from the current state (e-greedy 

selection scheme). The e-probability of choosing a non-optimal action allows the agent to explore new state-action 
{sn^On) pairs [37]. Therefore, RL intrinsically accounts for noise through e, which can be related to the noise of 
natural schooling systems. Here, we use a shared policy approach among all swimmers to accelerate the learning 
process thus Qn = Q and all agents update Q based on their personal experience. At every learning time interval 
6t, the swimmer updates the action-value function following Q(sn, an) = Q{sn^ an) + (p{AQ)n for n = 1,..., where 
(AQ)„ = r„ + ymaxa^ Q{s'^^an) — Q{sn^an) and 0 < (^ < 1 is the learning rate and 0 < 7 < 1 is the discount 
parameter which corresponds to the weight given to past experiences. We emphasize that learning individual policies, 
as opposed to the shared approach employed here, may allow agents finer behavioral tuning. For example, in the case 
of schooling, swimmers may adapt their policies depending on their location within the group. However, it has been 
empirically shown that the use of a shared policy reduces the time to convergence linearly with the number of agents 
m- In our study this entails a hundreds-fold reduction in computational cost, hence the rational behind the choice 
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of employing a shared approach. In the following the definitions of reward^ state and action are formalized. 


Reward. Since ultimately we are interested in investigating the hydrodynamic properties associated with different 
schooling geometries, we must first have the dipoles learn to swim in a given formation. This is achieved by setting 
that the goal of a swimmer is to follow a specified target point in a predefined arrangement as depicted in Fig. W- 
This allows hundreds of self-propelled dipoles to learn how to s wim co herently, a task out of reach for model based 
on handcrafted a priori interaction rules (in Sections HE and HID we detail how optimal arrangements can be 
obtained). The swimmer’s goal is mathematically cast into a numerical reward signal. The numerical reward is 
chosen to reflect how well the dipole swimmer can follow its assigned target point while doing the minimum amount 
of maneuvering so that = Wd — ^) + where r]a = 0 for traveling at uq, = — 1 foi* accelerating and 

turning, and rja = 1 for decelerating. Weights are set to Wd = 0.9 and Wa = 0.1 thus max(r^) = 1. We note how the 
second term of penalizes swimmers that take unnecessary actions, while it favors those that reduce their effort by 
slowing down. 


State. Swimmers can sense their distance, dn = |x^ — x^|, and orientation 0^ = arg(x^ — x^) — with respect 
to their assigned target point x^ within the school, as in Fig. W- We wish to stress the fact that the dynamics of 
a swimmer is not mapped on a lattice. Swimmers are in fact free to move in the continuum two-dimensional space, 
while they adaptively adjust their gait in the attempt of maintaining their relative position within the school. The 
quantities dn and On are each mapped into a set of L = 30 discrete states within the range Ad = 10D and AO = 27r 
such that Sn = {min(I/, max(0, [d^T/AdJ)), min(I/, max(0, [OnL/A0\))}. In total, each swimmer has a state-space 
that consists of 900 states. The choice of a target point over the sensing of the neighbors dramatically reduces the 
state space dimension (curse of dimensionality) allowing us to computationally tackle the problem. Moreover, it 
enables the study of structured schooling arrangements while it still captures the influence of the neighbors as they 
directly affect each other’s dynamics through long range hydrodynamic interactions. 


Actions. Each dipole swimmer can perform five actions. It can travel forward at three distinct speeds (its 
nominal traveling speed), or v~ = 0.9u^, or turn left or right with radius r^ = 10^ while traveling at . 

These a ctions are realized by allowing the dipole swimmer to adjust its vortex circulations T^ and accordingly 
(Section [lIB| ). Since for every state five different gaits are available, the overall state-action space amounts to 4500 
entries, to be stored in a table representing the action-value function Q(s^,a^). 


An example further illustrating the working principles of RL is reported in Section III A 


E. Optimization of schooling patterns via evolution strategies 


In the previous sections we established the algorithmic framework that allows self-propelled swimmers to learn how 
to keep their relative position within a school. We now look for optimal swimming configurations according to a 
desired metric {cost function) using the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [42]. The 
CMA-ES has been proved effective in a number of fluid mechanics and biological problems, from the optimization of 
gait and morphology in swimmers j43ll46] to the identification of virus traffic mechanisms [47] . 

The CMA-ES is a stochastic optimization algorithm that samples at each generation p parameter vectors from a 
multivariate Gaussian distribution JV. Here each parameter vector encodes the geometric configuration of a schooling 
arrangement (Appendix). The covariance matrix of the distribution Af is then adapted based on successful past schools, 
chosen according to their corresponding cost function value /. In the present context, CMA-ES evolves schooling con¬ 
figurations based on a metric of swimming effectiveness. In order to evaluate the cost function, i.e. the performance, 
of each school geometry, the dipole swimmers must first learn through RL how to swim coherently in that specific 
arrangement. Then, after a learning period ATieaming, the average cost function is evaluated by simulating the school 
for the time ATevai (Appendix). The value / so computed is then returned to the optimizer, which uses it to select 
the best configurations and produce a new, more performant generation of school arrangements, until convergence to 
an optimal solution. This process is depicted in Fig.|^ while the definition of the cost function is given in the following. 


Cost function. Here the cost function implements a metric of swimming effort for the entire school to be minimized. 
We relate the effort of an individual swimmer compared to its nominal cruise effort, i.e. its additional circulation 
expenditure, by defining 


AT® 

n 


AT, 


eval 


/ 


t + ATeval 


( -2r 

turning effort forward effort 


0 ^ 


dt, 


(7) 
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of each parameter set 

FIG. 4. Schematic of CMA-ES optimizer coupled with RL and the low-order model. CMA-ES dispatches /c = 1, ...,p parameter 
sets pk dehning the school conhguration. The RL framework allows agents to learn to swim in the formation characterized by 
Pfc. In return, CMA-ES receives a htness fk that captures the effectiveness of collective swimming relative to each school p^. 


where E^ is the nominal strength of each vortex. Thus, the cost function for the entire collection of swimmers is 
defined as / = ^n=i change in circulation can be associated with the production of vorticity involved 

in accelerations or turning maneuvers of the swimmer, therefore to the swimming effort. We also note as reference 
that / = 0 corresponds to cruise swimming of isolated dipoles. 


III. RESULTS 


A. Learning process for an individual dipole swimmer 


Despite the formalism, the working principles of RL are rather simple. We illustrate them here with the aid of a 
simple but representative problem, before proceeding further. 


We consider a dipole whose goal is to follow a prescribed trajectory. The trajectory is represented as a target point 
that moves by alternating straight runs to random turns of fixed radius . The dipole is aware at all times of 
its own bearing a and position x as well as of the target position x^. At regular intervals 6t^ the agent is faced with 
the problem of choosing whether to turn left or right, accelerate, slow down or keep straight in order to accurately 
follow the target point. The dipole has not been instructed how to act given its relative position to the target, i.e. 
no a priori local rules are enforced. Instead, every time the swimmer estimates its relative distance d and orientation 
0 with respect to x^, as illustrated in Fig. W- This is equivalent to determining the state s, i.e. the agent’s current 
situation. Since the intelligence or behavior of the swimmer is encoded as a multi-dimensional table or matrix, the 
continuous values of d and 0 are discretized into a number of integer values, as described in Section IIP Once the 


state matrix entry is determined, the agent can consult the expected rewards stored in the matrix that are associated 
with taking each of the five aforementioned actions a. These values Q{s, a) are constantly updated by the dipole and 
represent its past experience, and initially they are all set to zero. At this point the agent choses with probability 
1 — e the best action, i.e. the one with the largest Q{s,a) value, and after pursuing it, the new distances {d'^0') from 
the target are estimated, defining the new state s'. Moreover, based on d' the reward r is assigned to the dipole. The 
policy is then improved by discounting the old estimate of the expected reward Q(s, a) and complementing it with 
the new information r according to the update rule described in Section IIP This process is indefinitely repeated 
until convergence to an optimal behavioral policy. 


The evolution of the swimmer’s reward over time is given in Fig. and examples of the agent trajectories during 
the learning process are given in Fig. HI -d. The process of improving the policy is seen in Fig. [^, where after an 
initial transient, the agent progressively learns how to maximize its numerical reward by accurately following x^. In 
Fig. [^, the swimmer fails rather quickly as demonstrated when its path diverges from the target path at the first 
turn. Subsequently, in Fig. the swimmer follows adequately for a longer time before failing and in Fig. [^, the 
swimmer learns a policy that allows it to follow a pseudo-random path indefinitely. 
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FIG. 5. A single agent learns to optimally follow a moving target point x^. Given in (a) is the time evolution of the agent’s 
reward normalized by its maximum attainable value, based on the reward definition given in the main text. Panels (b), (c), 
and (d) correspond, respectively, to a non-adaptive, intermediate-adaptive, and well-adaptive learning stage. The starting time 
of panels (b-d) are marked on the x-axis of (a). Black dashed and solid red lines, correspond, respectively to the trajectories 
of the agent (green triangle) and the target point (red dot). Instantaneous streamlines (blue lines) are given as reference. 
Simulations are run in the domain [0,1] x [0,1] with i = 3 x 10“^, = 5^, = 10^, St = 0.1, ip — 0.01, 7 = 0.98, 

and e = 0.01. Notation defined in Section [n| and Appendix. 


B. Classical agent based models versus learning agents in the presence of hydrodynamics 


We first show that prescribed schooling patterns, including diamond and squares that have been proposed as 
favorable schooling patterns [7] , are not robustly maintained without an adaptive dynamic response of the swimmers 
to the flow field. In Fig. we report the results of sixteen swimming agents attempting to school in several formations 
initialized (t = 0) as shown in Figs. IT ,d,g. These initial patterns are characterized by a diamond-like, square-like 
and random arrangements. With pre-specified forward swimming gait, the relative swimmer locations will result in 
varying hydrodynamic loads, thus implying a dynamic rearrangement of the swimmers. Indeed, when the agents are 
assigned a specified swimming configuration, the simulated swimmers diverge from their relative positions and are 
prone to collisions with their neighbors due to flow-mediated interactions, consistently with [36]. As shown in Fig.[^ 
at t = 80, no collisions occur in the diamond-like configuration, but the swimmers are substantially strained apart. 
In Fig. [^, the square arrangement causes all agents to collide, while random configurations lead to straining and 
collision effects simultaneously (Fig. IT)- A qualitative hydrodynamic explanation for the disruption of square-like 
and diamond-like formations in provided in Section [III D| and Fig. In turn, in Figs. we show that schooling 

patterns can be maintained by swimmers through an adaptive modification of their swimming gaits that accounts 
for hydrodynamic interactions using a reinforcement learning algorithm. Therefore, through RL the agents learn to 
adjust their swimming gaits to compensate for the varying hydrodynamic loads typical of a liquid environment. This 
can be related to the noisy environments and the corresponding response of swimmers in natural schooling systems. 

We note that while the stability of diamond and square configurations has also been investigated by Tsang and 
Kanso [29|, our approach fundamentally differs. In fact, Tsang and Kanso consider infinite, doubly periodic lattices of 
dipoles characterized by a single gait. The dipoles are then cleverly arranged so that the resulting flow field passively 
stabilizes the lattice, removing altogether the need of responding to varying loads by adapting swimming gaits, and 
the associated energetic costs. Here, we renounce to the assumption of infinite schools in favor of a more realistic 
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initial/desired non-adaotive adaptive 



FIG. 6 . (a,d,g) Initial/desired, (b,e,h) non-adaptive, and (c,f,i) adaptive (with policy learned from RL) swimming configurations 
for 16 dipole swimmers at specified times. Red dipole swimmers have experienced a collision (at which point they no longer 
move). Instantaneous streamlines (blue lines) are given as reference. Dipole swimmers are initialized on a diamond lattice in 
(a,b,c), a square lattice in (d,e,f) and randomly in a circular region of radius of 17.5^ in (g,h,i). In all initial configurations 
a minimum inter-dipole spacing of 10^ is enforced. (l,m) Time evolution of a school of 100 agents obeying ‘dynamic parallel 
group’ (1) and ‘highly parallel group’ (m) models [TT] enhanced with hydrodynamic interactions. Simulations are run in the 
domain [0,1] x [0,1] with ^ = 5 x 10“^, = 5^, = 0.1r'°, = 10^, 6t = 0.1, p — 0.01, 7 = 0.98, e = 0.01. Notation defined 

in Section [n] and Appendix. 


description. As a consequence passive stabilization due to a stationary global flow field is no longer an option, hence 
the introduction of varying gaits and adaptive decision making. Therefore, our approach complements the results 
of [29], allowing us to study the hydrodynamic and energetic features associated with arbitrarily shaped, finite size 
schools. 


We also investigat ed th e dynamics of the Aoki-Couzin behavioral model El El in the presence of hydrodynamic 
interactions (Section [ll C ). This model relies on a priori handcrafted local interaction rules among agents and does 
not explicitly account for the flow environment during the decision making process. The time evolution of a school 
of 100 agents obeying ‘dynamic parallel group’ and ‘highly parallel group’ behavioral rules El is shown in Fig. 

We find that the swimmers experience substantial straining and collisions, due to the hydrodynamic coupling. This 
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Time 


FIG. 7. We compare the quantitative response of the Aoki-Couzin model with and without hydrodynamics interactions in 
terms of number of agents’ collisions in time. A collision is detected whenever the distance between two agents is found to be 
less than 21. Red and blue correspond, respectively, to ‘dynamically parallel group’ and ‘highly parallel group’ behavioral rules 
m- Blank and solid circles refer, respectively, to simulations with and without hydrodynamics. 


behavior is a drastic departure from schooling patterns observed when employing the original models m- Indeed, the 
number of collisions increases by 40% and 700% in, respectively, the ‘dynamic parallel group’ and ‘highly parallel 
group’ model in the presence of hydrodynamics (Fig. [^. These findings emphasize the role of the environment, 
especially in an hydrodynamic setting in which all agents are doubly connected through the flow. The fact that the 
non-linear response of the hydrodynamic system cannot be anticipated renders the definition of interaction rules by 
hand cumbersome, tedious, and ultimately not robust. 

The agent based models with a priori specified rules, such as the ones considered herein, are characterized by a 
large parameter space (size of each zone, attraction and repulsion weights, time step, etc.) and their results are known 
to be sensitive to these settings. In this study we have not explored the full parameter space and as such it can not 
rule out the possibility that particular parameter combinations may allow dipole swimmers to maintain structured 
arrangements or to exhibit robust schooling dynamics. Nevertheless the present investigation raises two key issues. 
Firstly, the introduction of the flow environment modifies the dynamics associated with classical agent based model 
settings. This implies that to reproduce the behavior observed for a given instance of the Aoki-Couzin model, a new 
set of parameters has to be discovered. Since zone sizes, attractive and repulsive forces posses a well defined ‘social’ 
meaning, the presence of the fluid affects these quantities and alters the nature of social interactions. Therefore, 
the characterization of social traits cannot prescind from accounting for the environment. Secondly, the sensitivity 
of classical models to parameter settings supports the need for rigorous, automatic procedures for the identification 
of local rules in the context of collective behavior. Indeed, we were unable to handcraft or derive through a direct 
optimisation process, any parameter set that enabled dipole swimmers to maintain a structured schooling formation. 
However this was readily achieved through the RL framework. We iterate that this finding does not represent a 
mathematical proof that handcrafted a priori models do not lead robustly to schooling, but it strongly emphasizes 
the need for computational methods that robustly guide the systematic exploration of their parameter space. 


C. Optimal internal structure of a school of dipole swimmers 


Since diamond lattices embedded in an infinite school have been suggested to be energetically favorable [ZIES] , we 
first characterize in terms of collective effort only the internal structure of schooling formations, disregarding edge 
effects related to the finite size of the school. The swimming effort of an individual is quantified by the variation of 
its dipole strength from a nominal value (AF®, Section HE). This effort changes during swimming, according to the 
reinforcement learning algorithm, to overcome hydrodynamic noise. We therefore optimize the bulk school str uctur e 
to minimise the cost function / defined as the linear sum of the efforts of all the swimmers in the group (Section |lIE[ ), 
with / = 0 corresponding to cruise swimming of isolated swimmers. We consider three different parameterizations 
for generalized internal structure: (a) diamond (b) rectangle, and (c) hexagon configurations as depicted in Eig. [^-c. 
The parameter P defines the angle between the swimming direction and axis defining the lattice structure (see the 
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inset of Figs. [^-c). The hexagon formations are a subset of diamond formations, with the added restriction of 
swimmers to be equidistant to its nearest neighbors. To minimize edge effects, we generate a circular shape and fill 
it with a lattice of N ^ 200 agents. Collective effort is only evaluated for the interior agents and our criteria chooses 
for the 50 agents closest to the center of the school. 

The starting configuration are given in Fig.[^-/, while the corresponding optimal solutions are reported in Fig. Hb-*- 
The swimmers form striated patterns and get closely packed to one another in their traveling direction while separating 
as far as possible (given the bounds of the optimization search space) in the orthogonal direction. The packing in 
the direction of travel is limited by the capacity of the agents to stay in formation due to the strong flow-mediated 
interactions. 

For the hexagon case (equidistant to its neighbors), from a starting configuration that is slightly detrimental for 
the school (/i = 0.001), the optimizer finds a configuration that gives no added benefit to the collective, as the fitness 
of case Fig. is /best ~ 0- If is concluded that the constraint of equidistant swimmers does not allow the agents to 
pack in the direction of travel and thus is not favorable for reducing circulation expenditure. 

The results reported in this section may be put in context in light of recent experiments that systematically 
investigated the thrust, power and efficiency performance of two side-by-side [48] and in-line pitching airfoils [49] . 
In the side-by-side case [48], it is found that the performance of individual airfoils are always anti-correlated except 
for perfectly in-phase or out-of-phase actuation (which may not be realistic or robust in a schooling system). This 
arrangement entails an overall constant system efficiency and the generation of a net torque, which needs to be 
compensated for in stable schooling arrays. In the context of our study, these results suggest no foreseeable benefits 
associated with parallel swimming dipoles. Indeed, the identified striated patterns tend to minimize anti-correlation 
effects, by stretching lateral spacings as much as possible, effectively decoupling parallel dipoles. The in-line pitching 
airfoils case is more complex. For small spacings s/£ < 0.5 (where s in the linear coordinate in the direction of 
travel and i is the airfoil chord) the performance of the leading and trailing airfoils are anti-correlated. For larger 
spacings instead, only the trailing body is affected and the vortex shedding from the leading airfoil has a prominent 
role in the observed dynamics. Extrapolating to multiple airfoils we may expect small spacings to be suboptimal. In 
fact, due to the strong anti-correlation every airfoil would experience both enhancing and disruptive effects. A larger 
spacing instead would allow, under the appropriate phase lag, all airfoils to benefit from flow coupling. The dipole 
model does not account explicitly for vortex shedding. As a consequence there is no phase lag or cutoff separation 
distance that controls the interactions of the leading and trailing dipoles. Nevertheless, the optimization process 
discards configurations characterized by small spacing since the strong anti-correlated dipole interaction poses control 
and learning problems and increases the swimming effort. Dipoles settle for larger spacing {s/£ > 2.55 in Fig. 
sl£ > 1.35 in Fig. to weaken anti-correlation effects, allowing for better stability and reduced effort, consistent 
with the above observations. We conclude that our findings, within the limits of our modeling approach, qualitatively 
capture the salient features associated with two-body swimming interactions [48l [49] . 


D. Optimal shape of a school of dipole swimmers 

We proceed by optimizing for collective swimming effort / of the school configuration as a whole, where both 
internal structure and edge effects compete simultaneously. We initialise a school formation by arranging A" = 100 
swimmers within a prescribed shape so as to maximize the distance from each other as well as the shape boundary 
{Appendix). The schooling arrangement is specified by four parameters p = davg}, where /ci, k 2 and 

characterize its shape, while davg determines its area A = T he pa rameters corresponding to optimal schooling 

arrangement are identified through stochastic optimization (Section jllEj ) that minimises /, with / = 0 corresponding 
to cruise swimming of isolated swimmers. 

The course of the stochastic optimization is shown in Eig. [^. The schooling arrangement evolves from the initial 
circular shape of Eig. to the optimal ‘hourglass’ solution of Eig. [^, reminiscent of shapes observed in nature for 
medium size schools [sniEi]- The ‘hourglass’ shape is associated with an ^ 80% area contraction and with a lO-fold 
reduction in collective effort. The color coding in Eig. c signifies the average swimming effort required by the dipoles 
to maintain their position in the school and illustrates how the circulation strength decreases for all swimmers, due 
to the more favorable arrangement. 

The streamlines of the collective flow fields are illustrated in Eig. [^-/. The swimmers in the ‘hourglass’ school are 
found to align in striated patterns (Eig. If), consistent with the findings of Eig. and synchronize to induce a stronger 
global dipolar field (Eig. |^,e), which is associated with a higher streamline density in the direction of travel and 
implies a stronger forward velocity. This formation allows swimmers to maintain their forward speed with a reduced 
individual effort (smaller circulation strength). Eurthermore, dipoles in the center of the necking region (Eig. [^,e) 
get the benefit of drafting from the swimmers in the front while at the same time they are pushed from the ones in 
the rear. In summary, packing to a smaller area while elongating the school shape, favors striated swimming patterns 
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FIG. 8 . The three different parameterizations for the internal structure of the school. We choose to investigate (a) diamond 
configurations parameterized by p = {5,(b) rectangular configurations parameterized by p = { 6 ,/i,/3}, and (c) hexagon 
configurations parameterized by p = /d}. In (c), all agents are equidistant to its nearest neighbors. The angle /3 represents 

the difference in direction of travel with respect to the axis defining the structure. The initial guesses (d-f) where the fitnesses 
f ^ 0 and best optimized solutions (g-i) for the parameters defining the internal structure as defined according to (a-c). 
Population size is p = 100. Streamlines are given in blue. Optimal parameters are (g) pbest = {49.70^,2.55^,0.02}, (h) 
Pbest = {45.90^,5.20^,-0.03}, and (i) pbest = {37.25^,0.52} corresponding to the fitnesses /best = -0.175,-0.184, and 0.000. 
Swimmers tend to form striated patterns in diamond and rectangular configurations and in the case where dipole swimmers 
are required to be equidistant from one another, there is no apparent hydrodynamic benefit from staying in the collective. 
Simulations are run in a [0,1] x [0,1] box with ^ = 5 x 10“^ = 1 x 10“^),i’° = 5^, = 10^, St = 0.1, p — 0.01, 7 = 0.98, 

e = 0.01. Notation defined in Section [U] and Appendix. We note that the model relies on the assumption that swimmers are 
separated more than one characteristic length I from one another [36]. This condition is met here with a minimum distance of 
2.55^ for case (g), corresponding to the spacing h in our parameterization. This implies that swimmers could pack even tighter, 
but this scenario is found to be suboptimal. 


(Fig. 1^;/) so that swimmers benefit from flow-mediated interactions. We note that the optimal swimming pattern 
exhibits a swimming effort that outperforms the one of the Aoki-Couzin models by 190% (/a priori models — 0.06). 

Moreover, specified square-like and diamond-like formations are also shown to be detrimental in terms of collective 
and individual effort (Fig. [Tq|), nevertheless their analysis is revealing of the hydrodynamic mechanisms at play in 
the ‘hourglass’ optimal solution. Indeed, although the fitness for the entire school is / = 0.004 for both Fig. [TQ| a 
and Fig. [TO^ , swimmers exposed on the left and right edges of the diamond formation suffer from high circulation 
expenditures, while those in the square one in general do not. In the square formation, swimmers line up near the left 
and right edges of the school to help create greater net flow in the direction of travel. Conversely, swimmers in the 
front or the rear of the diamond school benefit from schooling, while those in the square formation do not due to the 
counter flow produced from their immediate left and right neighbors. The optimal school shape solution of Fig. [Tq| c 
takes advantage of these two effects. The ‘hourglass’ shape allows agents to line up near the boundary of the school 
and elongates in the traveling direction to reduce the counter flow from agents on the right and left edges. 


IV. DISCUSSION 

We present a novel approach for studying schooling by coupling reinforcement learning and stochastic optimisation 
with swimming agents [36] that explicitly account for hydrodynamic interactions. Swimmers are modeled as self- 
propelled, finite width dipoles and have the capability to adjust their speed and orientation to cope with varying 
hydrodynamic loads. The dipoles represent the far field vortex wake of self-propelled natural swimmers, so that 
agents impart long range velocity fields on all other agents via their dipolar strengths. These non-linear hydrodynamic 
interactions critically affect schooling dynamics and the individual swimming patterns. We show that classical agent 
models, which rely on a priori specified social rules, are not robust in the presence of hydrodynamics. In order 
to compensate for hydrodynamics and to allow for schooling in the present agent based model, we reverse engineer 
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evaluations 



FIG. 9. (a) Evolution of cost function / versus number of cost function evaluations. Population size per generation is p = 100. 
Blue and green lines correspond to, respectively, best solution in the current generation and best solution ever. Also given are 
school formations corresponding, respectively, to the (b) starting search point (pi = {3,1,1,1.57}, fi = —0.005) and (c) best 
point (pbest = {1.35,0.5,1.98,2.42}, /best = —0.067). The additional circulation change is computed for each swimmer and 
colored accordingly. Simulations are run in the domain [0,1] x [0,1] with ^ = 5 x 10“^, = bi, = 10^, 6t = 0.1, p = 0.01, 

7 = 0.98, e = 0.01. All configurations correspond to a simulation time t = 100. Notation defined in Section [n| and Appendix. 
Streamlines of the (g) starting search point, and (e) the best solution configuration. Given in (f) is a detailed view of the dashed 
box of (e) with the dipoles swimmers represented as black dots. As shown in (f), it is advantageous for the dipole swimmers to 
line up in striated patterns to aid in creating a favorable flow in the direction of travel. Indeed, the higher streamlines density 
in this direction is the footprint of a stronger longitudinal velocity field. We note that the dipole model relies on the assumption 
that swimmers are separated more than one characteristic length i. from one another [36]. This condition is met here with a 
minimum distance of 1.35^ (first entry of the parameter vector pbest)- This implies that swimmers could pack even tighter, but 
this scenario is found to be suboptimal. 
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FIG. 10. Gomparison between (a) square formation (/ = 0.004), (b) diamond formation (/ = 0.004) and (c) the optimal solution 
(/best = —0.067). The additional circulation change is computed for each swimmer and colored accordingly. Simulations are 
run in the domain [0,1] x [0,1] with ^ = 5 x 10“^, = 5^, = 10^, 3t = 0.1, p = 0.01, 7 = 0.98, e = 0.01. Notation defined 

in Section [II| and Appendix. 
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the rules that are followed by the swimmers. This reverse engineering is achieved through a reinforcement learning 
algorithm, that creates mappings between the dynamic environment of the agents and their actions so as to maximize 
a numerical reward. Our approach differs from the widely used handcrafted a priori behavioral rules, and allows 
us to examine how hydrodynamics affects swimmers’ decision-making in schooling. We find that adaptive swimming 
policies are crucial for maintaining schooling formations. 

We evaluate the effectiveness of various formation patterns through an evolution strategy algorithm that identifies 
optimal schooling shapes and swimmer arrangements. We find that schools exhibiting minimal collective effort are 
‘hourglass’ shaped and elongated in the swimming direction. Elongated shapes allow for drafting and pushing of 
swimmers arranged in internal striated patterns. Such internal striated patterns are found to be optimal independently 
of the overall shape of the school, qualitatively consistent with experimental observations [481451] . 

Moreover, it is found that a tight packing of swimmers inside the school allows them to exploit flow-mediated 
interactions in terms of collective effort by enhancing the global dipolar field. At the same time there is a limit to 
the amount of packing, as it becomes increasingly difficult for the swimmers to stay in formation, due to stronger 
flow-mediated effects and increased probability of collisions. Such flow-mediated interactions can help explaining how 
certain fish travel in dense, elongated packs when migrating and foraging. 

We wish to note that the present reverse engineering approach for the automatic identification of interaction rules 
relative to a goal can be readily generalized to other forms of collective behaviors, from car traffic to social aggregations. 
In the context of schooling, future work is concerned with extending the use of present learning and optimization 
techniques to two- and three-dimensional viscous flows of multiple swimmers at intermediate Reynolds numbers [52] . 
Reverse engineering techniques, as the ones proposed herein, can then be used to identify the various evolutionary 
traits that may have led to fish schooling. 


APPENDIX 


A. Learning optimal behavior in a fluid mediated environment 


Time integration and handling of eollisions 


The set of ODEs given in Eqs. ( 2apb ) in the main text is numerically integrated using the forward Euler method 
so as to maintain flexibility with the decision selection process of the reinforcement learning algorithm. Since the 
equations become stiff as swimmers approach one another, the time step dt is computed according to the minimum 
distance dmin between all dipole swimmers present in the environment. We bound dt with dt = dtmax if dmin > 2^ 
and dt = dtmin = 5e“^ if dmin < ^/2. We impose dtmax to be dependent on the nominal velocity such that 
dtmax = Eor the simulations given here, C = 0.005. 

Dipoles tend to collide if they are in close proximity with another dipole [36]. In a collision of two dipoles, each 
dipole stops and annihilates their respective circulation strengths. We manage this phenomenon by labeling colliding 
dipoles as ‘dead’ if dt = dtmin- The dipoles’ circulations are artificially set to zero, hence, dead dipoles no longer 
influence the other swimmers in the flow. 


2. Error analysis of non-adaptive and adaptive sehools of dipole swimmers 

As noticed in Eig. of the main text and in Eig. here, in the presence of hydrodynamic interaction a given 
schooling arrangement is not maintained robustly without an adaptive dynamic response to the flow. We quantita¬ 
tively demonstrate that schooling patterns can be maintained by swimmers through reinforcement learning. This is 
illustrated in Eig.j^in which the errors (average distance to the target points e = dn/^) between the simulations 

of Eig. are compared. 


B. Optimal schooling formations 

We seek optimal schooling configurations by combining RL with an evolutionary strategy. The algorithm of choice 
is the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) in its multi-host, rank-/i and weighted recom¬ 
bination form nil [53| . The robustness of CMA-ES is mainly controlled by the population size p [42] . In this work, 
as a tradeoff between robustness and fast convergence, we set p = 100 for all optimization campaigns. Bounds of the 
search space are enforced during the sampling through a rejection algorithm. 
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FIG. 11. The time evolution of the sum of all errors between all agents and their respective assigned target points. The 
non-adaptive school simulations (dotted lines) of diamond (red), square (blue), and random (black) formations correspond to 
Fig.[^,e,/i of the main text, respectively. The error in adaptive school simulations (bottom set of solid lines) of diamond (red), 
square (blue), and random (black) formations correspond to Fig. of the main text, respectively. The error is dehned as 

e = dn/i, i.e. the average distance to the target points. 


In this strategy CMA-ES determines the optimal configuration based on the metric of swimming effectiveness /, 
while RL determines the optimal policy for an agent to follow its target point under any configuration requested 
by CMA-ES. Therefore, every cost function evaluation entails CMA-ES dispatching a parameter set defining the 
geometry of the school, then a RL training period (ATtraining = 10000) that allows the dipoles to learn how to swim in 
the given arrangement, followed by an evaluation interval (ATevai = 100) in which the school effectiveness is measured 
and returned to CMA-ES. 


1. CMA-ES settings and parameterization for the optimization of sehool shape 


We evaluate how the overall school shape affects collective effort and optimize its effectiveness in terms of Eq. 0 
of the main text. To do so, we create a school by designing a general external shape and arranging the swimmers 
inside the boundary by placing them maximally distant from one another and the boundary. The shape of the school 
dictated by a cubic spline-based parameterization introduced in [54] . According to this parameterization, the external 
school shape reads as c = S' (0), where c is the radial distance from the origin in the polar plane, <f> G [0,7r] is the 
angle, and S is the piecewise polynomial of the cubic spline. The spline control points (red dots in Eig. [^, expressed 
in polar coordinates, are (co,0), (ci,0i) and (c 2 , 0 ) with the radii defined as ci = ki • cq and C 2 = and 

k 2 being constants. The school shape is completed by mirroring the obtained spline profile. Unlike [54], the area of 
the school shape A(S), and therefore cq, is controlled through an extra parameter, namely the average distance davg 
between swimmers, such that bo : A{S) = N • (i^vg^ where N is the number of dipoles. Therefore, the shape of the 
school relies on the four parameters p = {(iavg, ^ 2 , 0i}- These free parameters are varied within the search space 
[1,5] X [0.1,2] X [0.1,2] X [7r/6,57r/6] G during the optimization. 


2. CMA-ES settings for the optimization of internal lattiee strueture of a sehool 

The free parameters /i, 6, and for the diamond, rectangular, and hexagonal search space (see Eig. |^-c of the 
main text) range from [0,50-^] x [0,50^] x [0,7r/2] G [0,50^] x [0,50-^] x [0,7r/2] G and [0,50^] x [0,7r/6] G 
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FIG. 12. Parametrization of the shape of the school and a representative subset of candidate solutions. The parameterization 
creates the solid line (blue) and is mirrored to create the symmetric complement (dashed line). The control points are depicted 
as red dots. 


respectively, during the optimization. 
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